{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "<br>汉化的库: <a href=\"https://github.com/GoatCsu/CN-LLMs-from-scratch.git\">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
   "metadata": {
    "tags": []
   },
   "source": [
    "# 使用 OpenAI API 评估指令响应"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a128651b-f326-4232-a994-42f38b7ed520",
   "metadata": {},
   "source": [
    "- 本笔记本使用 OpenAI 的 GPT-4 API 来评估基于指令微调的 LLMs（大型语言模型）生成的响应，这些响应基于包含生成的模型响应的 JSON 格式数据集，例如：\n",
    "\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"instruction\": \"What is the atomic number of helium?\",\n",
    "    \"input\": \"\",\n",
    "    \"output\": \"The atomic number of helium is 2.\",               # <-- 测试集中给出的目标\n",
    "    \"model 1 response\": \"\\nThe atomic number of helium is 2.0.\", # <-- 由第一个 LLM 生成的响应\n",
    "    \"model 2 response\": \"\\nThe atomic number of helium is 3.\"    # <-- 由第二个 LLM 生成的响应\n",
    "},\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install -r requirements-extra.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "openai version: 1.30.3\n",
      "tqdm version: 4.66.2\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"openai\",  # OpenAI API\n",
    "        \"tqdm\",    # 进度条\n",
    "        ]\n",
    "\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
   "metadata": {},
   "source": [
    "## 测试 OpenAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
   "metadata": {},
   "source": [
    "- 首先，让我们测试 OpenAI API 是否正确设置\n",
    "- 如果你还没有账户，可以前往 [https://platform.openai.com/](https://platform.openai.com/) 创建一个\n",
    "- 请注意，你还需要为账户充值，因为 GPT-4 API 并非免费（详情见 [https://platform.openai.com/settings/organization/billing/overview](https://platform.openai.com/settings/organization/billing/overview)）\n",
    "- 使用本笔记本中的代码运行实验并创建约 200 个评估的费用大约为 $0.26（即 26 美分），截至目前为止\n",
    "\n",
    "- 请确保不要将这个密钥与任何人共享\n",
    "- 将该密钥（`\"sk-...\"`）添加到此文件夹中的 `config.json` 文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "65b0ba76-1fb1-4306-a7c2-8f3bb637ccdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from openai import OpenAI\n",
    "\n",
    "# 从 JSON 文件中加载 API 密钥\n",
    "# 请确保将 \"sk-...\" 替换为您在 https://platform.openai.com/api-keys 上的实际 API 密钥\n",
    "with open(\"config.json\", \"r\") as config_file:\n",
    "    config = json.load(config_file)\n",
    "    api_key = config[\"OPENAI_API_KEY\"]\n",
    "\n",
    "client = OpenAI(api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
   "metadata": {},
   "source": [
    "- 首先，让我们通过一个简单的例子来测试 API，确保它按预期工作："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "08e9ef2e-e816-4283-840e-43625791ad33",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello world'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.0,\n",
    "        seed=123,\n",
    "    )\n",
    "    return response.choices[0].message.content\n",
    "\n",
    "\n",
    "prompt = f\"Respond with 'hello world' if you got this message.\"\n",
    "run_chatgpt(prompt, client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
   "metadata": {},
   "source": [
    "## 加载 JSON 条目"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
   "metadata": {},
   "source": [
    "- 在这里，我们假设我们已经将测试数据集和模型的响应保存为一个 JSON 文件，可以通过以下方式加载："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 100\n"
     ]
    }
   ],
   "source": [
    "json_file = \"eval-example-data.json\"\n",
    "\n",
    "with open(json_file, \"r\") as file:\n",
    "    json_data = json.load(file)\n",
    "\n",
    "print(\"Number of entries:\", len(json_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c9751b-59b7-43fe-acc7-14e8daf2fa66",
   "metadata": {},
   "source": [
    "- 该文件的结构如下，其中包含测试数据集中的给定响应（`'output'`）以及两个不同模型的响应（`'model 1 response'` 和 `'model 2 response'`）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7222fdc0-5684-4f2b-b741-3e341851359e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',\n",
       " 'input': '',\n",
       " 'output': 'The hypotenuse of the triangle is 10 cm.',\n",
       " 'model 1 response': '\\nThe hypotenuse of the triangle is 3 cm.',\n",
       " 'model 2 response': '\\nThe hypotenuse of the triangle is 12 cm.'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcf0331b-6024-4bba-89a9-a088b14a1046",
   "metadata": {},
   "source": [
    "- 以下是一个小的实用函数，用于后续可视化时格式化输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "43263cd3-e5fb-4ab5-871e-3ad6e7d21a8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_input(entry):\n",
    "    instruction_text = (\n",
    "        f\"Below is an instruction that describes a task. Write a response that \"\n",
    "        f\"appropriately completes the request.\"\n",
    "        f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n",
    "    )\n",
    "\n",
    "    input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n",
    "    instruction_text + input_text\n",
    "\n",
    "    return instruction_text + input_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
   "metadata": {},
   "source": [
    "- 现在，让我们尝试使用OpenAI API来比较模型的响应（我们只评估前5个响应进行可视化比较）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "735cc089-d127-480a-b39d-0782581f0c41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Dataset response:\n",
      ">> The hypotenuse of the triangle is 10 cm.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The hypotenuse of the triangle is 3 cm.\n",
      "\n",
      "Score:\n",
      ">> The model response \"The hypotenuse of the triangle is 3 cm.\" is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm can be found using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Mathematically, this is expressed as:\n",
      "\n",
      "\\[ c = \\sqrt{a^2 + b^2} \\]\n",
      "\\[ c = \\sqrt{6^2 + 8^2} \\]\n",
      "\\[ c = \\sqrt{36 + 64} \\]\n",
      "\\[ c = \\sqrt{100} \\]\n",
      "\\[ c = 10 \\text{ cm} \\]\n",
      "\n",
      "The correct answer should be 10 cm. The response given as 3 cm is not only incorrect but also significantly off from the correct value. This error could lead to misunderstandings or incorrect applications in practical scenarios where precise measurements are crucial.\n",
      "\n",
      "Given the scale from 0 to 100, where 100 is the best score, the response would score very low due to its inaccuracy. However, since the response format is correct (stating the measurement and unit), it does not score the absolute minimum.\n",
      "\n",
      "**Score: 10/100**\n",
      "\n",
      "This score reflects that while the format of the response is correct, the content is highly inaccurate.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> 1. Squirrel\n",
      "2. Eagle\n",
      "3. Tiger\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "1. Squirrel\n",
      "2. Tiger\n",
      "3. Eagle\n",
      "4. Cobra\n",
      "5. Tiger\n",
      "6. Cobra\n",
      "\n",
      "Score:\n",
      ">> The model response lists six animals, three of which (squirrel, tiger, eagle) are indeed active during the day, making them correct responses to the instruction. However, the instruction specifically asked for three different animals, and the model response includes repetitions (tiger and cobra are each listed twice) and also exceeds the requested number of animals.\n",
      "\n",
      "The inclusion of \"cobra\" is incorrect as most cobras are not diurnal (active during the day); they are generally more active during the early morning and late evening, which can be considered crepuscular rather than diurnal.\n",
      "\n",
      "### Scoring Breakdown:\n",
      "- **Relevance to the task**: The response correctly identifies three diurnal animals but also includes additional animals, which was not requested.\n",
      "- **Accuracy**: Including animals not active during the day (cobra) and repeating animals reduces the accuracy.\n",
      "- **Adherence to instructions**: The task was to name three different animals, but the response included six names with repetitions.\n",
      "\n",
      "Given these points, the response partially meets the requirements but also deviates significantly in terms of the number of animals and the inclusion of incorrect and repeated entries.\n",
      "\n",
      "### Score: 50/100\n",
      "This score reflects that while the response did include three correct animals, it failed to strictly follow the instructions by listing only three different animals and included incorrect information.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> I must ascertain what is incorrect.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "What is incorrect?\n",
      "\n",
      "Score:\n",
      ">> The model response \"What is incorrect?\" scores low in terms of fulfilling the instruction to rewrite the sentence in a more formal way. The original sentence \"I need to find out what's wrong.\" expresses a personal obligation and a process of discovery, which is not captured in the model response. The model response turns the sentence into a direct question and loses the nuance of needing to discover or investigate the issue.\n",
      "\n",
      "**Score: 20/100**\n",
      "\n",
      "**Reasoning:**\n",
      "- **Formality:** The response is slightly more formal than casual speech but does not elevate the formality significantly or appropriately. It does use \"incorrect\" which is slightly more formal than \"wrong.\"\n",
      "- **Completeness:** The response fails to include the aspect of needing to find out or ascertain, which is a critical part of the original sentence.\n",
      "- **Accuracy:** The response changes the structure and intent by converting it into a direct question, which does not align with the instruction to rewrite the statement while maintaining its original intent.\n",
      "\n",
      "Overall, the response does not adequately meet the requirements of the task as it significantly alters the meaning and omits key elements of the original sentence.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The interjection in the sentence is 'Wow'.\n",
      "\n",
      "Score:\n",
      ">> The model response `The interjection in the sentence is 'Wow'.` accurately identifies the interjection in the provided sentence. The response is clear, directly addresses the instruction, and correctly identifies \"Wow\" as the interjection, which is used to express surprise or admiration, fitting the context of the sentence. Therefore, the response is fully correct and meets all the requirements of the task.\n",
      "\n",
      "Score: 100/100\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Dataset response:\n",
      ">> The type of sentence is interrogative.\n",
      "\n",
      "Model response:\n",
      ">> \n",
      "The type of sentence is exclamatory.\n",
      "\n",
      "Score:\n",
      ">> The model response \"The type of sentence is exclamatory.\" is incorrect. The input sentence \"Did you finish the report?\" is clearly an interrogative sentence as it is asking a question, indicated by the question mark at the end and the structure of the sentence.\n",
      "\n",
      "Given the scoring criteria where 100 is the best score and should be awarded to a correct and precise response, the model's response should receive a low score because it incorrectly identifies the type of sentence. An exclamatory sentence typically expresses strong emotion and ends with an exclamation mark, which is not the case here.\n",
      "\n",
      "Therefore, the score for the model response would be 0 out of 100, as it completely misidentifies the type of sentence, providing incorrect information.\n",
      "\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "for entry in json_data[:5]:\n",
    "    prompt = (f\"Given the input `{format_input(entry)}` \"\n",
    "              f\"and correct output `{entry['output']}`, \"\n",
    "              f\"score the model response `{entry['model 1 response']}`\"\n",
    "              f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "              )\n",
    "    print(\"\\nDataset response:\")\n",
    "    print(\">>\", entry['output'])\n",
    "    print(\"\\nModel response:\")\n",
    "    print(\">>\", entry[\"model 1 response\"])\n",
    "    print(\"\\nScore:\")\n",
    "    print(\">>\", run_chatgpt(prompt, client))\n",
    "    print(\"\\n-------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
   "metadata": {},
   "source": [
    "- 请注意，响应非常冗长；为了量化哪个模型更好，我们只需要返回分数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3552bdfb-7511-42ac-a9ec-da672e2a5468",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "\n",
    "def generate_model_scores(json_data, json_key, client):\n",
    "    scores = []\n",
    "    for entry in tqdm(json_data, desc=\"Scoring entries\"):\n",
    "        prompt = (\n",
    "            f\"Given the input `{format_input(entry)}` \"\n",
    "            f\"and correct output `{entry['output']}`, \"\n",
    "            f\"score the model response `{entry[json_key]}`\"\n",
    "            f\" on a scale from 0 to 100, where 100 is the best score. \"\n",
    "            f\"Respond with the number only.\"\n",
    "        )\n",
    "        score = run_chatgpt(prompt, client)\n",
    "        try:\n",
    "            scores.append(int(score))\n",
    "        except ValueError:\n",
    "            continue\n",
    "\n",
    "    return scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71974dea-31ed-49af-abba-5c858bbbf49c",
   "metadata": {},
   "source": [
    "- 请注意，由于 OpenAI 的 GPT 模型即使设置了随机数种子等，仍然是非确定性的，因此响应分数可能会有所不同。\n",
    "\n",
    "- 我们将把这种评估应用于整个数据集，并计算每个模型的平均分数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|████████████████████████| 100/100 [01:03<00:00,  1.56it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 1 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 74.09\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00,  1.50it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "model 2 response\n",
      "Number of scores: 100 of 100\n",
      "Average score: 56.57\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "for model in (\"model 1 response\", \"model 2 response\"):\n",
    "\n",
    "    scores = generate_model_scores(json_data, model, client)\n",
    "    print(f\"\\n{model}\")\n",
    "    print(f\"Number of scores: {len(scores)} of {len(json_data)}\")\n",
    "    print(f\"Average score: {sum(scores)/len(scores):.2f}\\n\")\n",
    "\n",
    "    # 可选：将分数保存到 JSON 文件中\n",
    "    save_path = Path(\"scores\") / f\"gpt4-{model.replace(' ', '-')}.json\"\n",
    "    with open(save_path, \"w\") as file:\n",
    "        json.dump(scores, file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8169d534-1fec-43c4-9550-5cb701ff7f05",
   "metadata": {},
   "source": [
    "- 根据上述评估，我们可以得出结论：第一个模型明显优于第二个模型。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
